14 research outputs found
PersonalTailor: Personalizing 2D Pattern Design from 3D Garment Point Clouds
Garment pattern design aims to convert a 3D garment to the corresponding 2D
panels and their sewing structure. Existing methods rely either on template
fitting with heuristics and prior assumptions, or on model learning with
complicated shape parameterization. Importantly, both approaches do not allow
for personalization of the output garment, which today has increasing demands.
To fill this demand, we introduce PersonalTailor: a personalized 2D pattern
design method, where the user can input specific constraints or demands (in
language or sketch) for personal 2D panel fabrication from 3D point clouds.
PersonalTailor first learns a multi-modal panel embeddings based on
unsupervised cross-modal association and attentive fusion. It then predicts a
binary panel masks individually using a transformer encoder-decoder framework.
Extensive experiments show that our PersonalTailor excels on both personalized
and standard pattern fabrication tasks.Comment: Technical Repor
Semi-Supervised Temporal Action Detection with Proposal-Free Masking
Existing temporal action detection (TAD) methods rely on a large number of
training data with segment-level annotations. Collecting and annotating such a
training set is thus highly expensive and unscalable. Semi-supervised TAD
(SS-TAD) alleviates this problem by leveraging unlabeled videos freely
available at scale. However, SS-TAD is also a much more challenging problem
than supervised TAD, and consequently much under-studied. Prior SS-TAD methods
directly combine an existing proposal-based TAD method and a SSL method. Due to
their sequential localization (e.g, proposal generation) and classification
design, they are prone to proposal error propagation. To overcome this
limitation, in this work we propose a novel Semi-supervised Temporal action
detection model based on PropOsal-free Temporal mask (SPOT) with a parallel
localization (mask generation) and classification architecture. Such a novel
design effectively eliminates the dependence between localization and
classification by cutting off the route for error propagation in-between. We
further introduce an interaction mechanism between classification and
localization for prediction refinement, and a new pretext task for
self-supervised model pre-training. Extensive experiments on two standard
benchmarks show that our SPOT outperforms state-of-the-art alternatives, often
by a large margin. The PyTorch implementation of SPOT is available at
https://github.com/sauradip/SPOTComment: ECCV 2022; Code available at https://github.com/sauradip/SPO
Post-Processing Temporal Action Detection
Existing Temporal Action Detection (TAD) methods typically take a
pre-processing step in converting an input varying-length video into a
fixed-length snippet representation sequence, before temporal boundary
estimation and action classification. This pre-processing step would temporally
downsample the video, reducing the inference resolution and hampering the
detection performance in the original temporal resolution. In essence, this is
due to a temporal quantization error introduced during the resolution
downsampling and recovery. This could negatively impact the TAD performance,
but is largely ignored by existing methods. To address this problem, in this
work we introduce a novel model-agnostic post-processing method without model
redesign and retraining. Specifically, we model the start and end points of
action instances with a Gaussian distribution for enabling temporal boundary
inference at a sub-snippet level. We further introduce an efficient
Taylor-expansion based approximation, dubbed as Gaussian Approximated
Post-processing (GAP). Extensive experiments demonstrate that our GAP can
consistently improve a wide variety of pre-trained off-the-shelf TAD models on
the challenging ActivityNet (+0.2% -0.7% in average mAP) and THUMOS (+0.2%
-0.5% in average mAP) benchmarks. Such performance gains are already
significant and highly comparable to those achieved by novel model designs.
Also, GAP can be integrated with model training for further performance gain.
Importantly, GAP enables lower temporal resolutions for more efficient
inference, facilitating low-resource applications. The code will be available
in https://github.com/sauradip/GAPComment: Technical Repor
Large-Scale Product Retrieval with Weakly Supervised Representation Learning
Large-scale weakly supervised product retrieval is a practically useful yet
computationally challenging problem. This paper introduces a novel solution for
the eBay Visual Search Challenge (eProduct) held at the Ninth Workshop on
Fine-Grained Visual Categorisation workshop (FGVC9) of CVPR 2022. This
competition presents two challenges: (a) E-commerce is a drastically
fine-grained domain including many products with subtle visual differences; (b)
A lacking of target instance-level labels for model training, with only coarse
category labels and product titles available. To overcome these obstacles, we
formulate a strong solution by a set of dedicated designs: (a) Instead of using
text training data directly, we mine thousands of pseudo-attributes from
product titles and use them as the ground truths for multi-label
classification. (b) We incorporate several strong backbones with advanced
training recipes for more discriminative representation learning. (c) We
further introduce a number of post-processing techniques including whitening,
re-ranking and model ensemble for retrieval enhancement. By achieving 71.53%
MAR, our solution "Involution King" achieves the second position on the
leaderboard.Comment: FGVC9 CVPR202
DiffSED: Sound Event Detection with Denoising Diffusion
Sound Event Detection (SED) aims to predict the temporal boundaries of all
the events of interest and their class labels, given an unconstrained audio
sample. Taking either the splitand-classify (i.e., frame-level) strategy or the
more principled event-level modeling approach, all existing methods consider
the SED problem from the discriminative learning perspective. In this work, we
reformulate the SED problem by taking a generative learning perspective.
Specifically, we aim to generate sound temporal boundaries from noisy proposals
in a denoising diffusion process, conditioned on a target audio sample. During
training, our model learns to reverse the noising process by converting noisy
latent queries to the groundtruth versions in the elegant Transformer decoder
framework. Doing so enables the model generate accurate event boundaries from
even noisy queries during inference. Extensive experiments on the Urban-SED and
EPIC-Sounds datasets demonstrate that our model significantly outperforms
existing alternatives, with 40+% faster convergence in training
MSQNet: Actor-agnostic Action Recognition with Multi-modal Query
Existing action recognition methods are typically actor-specific due to the
intrinsic topological and apparent differences among the actors. This requires
actor-specific pose estimation (e.g., humans vs. animals), leading to
cumbersome model design complexity and high maintenance costs. Moreover, they
often focus on learning the visual modality alone and single-label
classification whilst neglecting other available information sources (e.g.,
class name text) and the concurrent occurrence of multiple actions. To overcome
these limitations, we propose a new approach called 'actor-agnostic multi-modal
multi-label action recognition,' which offers a unified solution for various
types of actors, including humans and animals. We further formulate a novel
Multi-modal Semantic Query Network (MSQNet) model in a transformer-based object
detection framework (e.g., DETR), characterized by leveraging visual and
textual modalities to represent the action classes better. The elimination of
actor-specific model designs is a key advantage, as it removes the need for
actor pose estimation altogether. Extensive experiments on five publicly
available benchmarks show that our MSQNet consistently outperforms the prior
arts of actor-specific alternatives on human and animal single- and multi-label
action recognition tasks by up to 50%. Code will be released at
https://github.com/mondalanindya/MSQNet
DiffTAD: Temporal Action Detection with Proposal Denoising Diffusion
We propose a new formulation of temporal action detection (TAD) with
denoising diffusion, DiffTAD in short. Taking as input random temporal
proposals, it can yield action proposals accurately given an untrimmed long
video. This presents a generative modeling perspective, against previous
discriminative learning manners. This capability is achieved by first diffusing
the ground-truth proposals to random ones (i.e., the forward/noising process)
and then learning to reverse the noising process (i.e., the backward/denoising
process). Concretely, we establish the denoising process in the Transformer
decoder (e.g., DETR) by introducing a temporal location query design with
faster convergence in training. We further propose a cross-step selective
conditioning algorithm for inference acceleration. Extensive evaluations on
ActivityNet and THUMOS show that our DiffTAD achieves top performance compared
to previous art alternatives. The code will be made available at
https://github.com/sauradip/DiffusionTAD.Comment: Technical Repor